Over the past few years, developing a broad, universal, and general-purpose computer vision system has become a hot topic. A powerful universal system would be capable of solving diverse vision tasks simultaneously without being restricted to a specific problem or a specific data domain, which is of great importance in practical real-world computer vision applications. This study pushes the direction forward by concentrating on the million-scale multi-domain universal object detection problem. The problem is not trivial due to its complicated nature in terms of cross-dataset category label duplication, label conflicts, and the hierarchical taxonomy handling. Moreover, what is the resource-efficient way to utilize emerging large pre-trained vision models for million-scale cross-dataset object detection remains an open challenge. This paper tries to address these challenges by introducing our practices in label handling, hierarchy-aware loss design and resource-efficient model training with a pre-trained large model. Our method is ranked second in the object detection track of Robust Vision Challenge 2022 (RVC 2022). We hope our detailed study would serve as an alternative practice paradigm for similar problems in the community. The code is available at https://github.com/linfeng93/Large-UniDet.
translated by 谷歌翻译
Recognition of facial expression is a challenge when it comes to computer vision. The primary reasons are class imbalance due to data collection and uncertainty due to inherent noise such as fuzzy facial expressions and inconsistent labels. However, current research has focused either on the problem of class imbalance or on the problem of uncertainty, ignoring the intersection of how to address these two problems. Therefore, in this paper, we propose a framework based on Resnet and Attention to solve the above problems. We design weight for each class. Through the penalty mechanism, our model will pay more attention to the learning of small samples during training, and the resulting decrease in model accuracy can be improved by a Convolutional Block Attention Module (CBAM). Meanwhile, our backbone network will also learn an uncertain feature for each sample. By mixing uncertain features between samples, the model can better learn those features that can be used for classification, thus suppressing uncertainty. Experiments show that our method surpasses most basic methods in terms of accuracy on facial expression data sets (e.g., AffectNet, RAF-DB), and it also solves the problem of class imbalance well.
translated by 谷歌翻译
Attention-based arbitrary style transfer studies have shown promising performance in synthesizing vivid local style details. They typically use the all-to-all attention mechanism: each position of content features is fully matched to all positions of style features. However, all-to-all attention tends to generate distorted style patterns and has quadratic complexity. It virtually limits both the effectiveness and efficiency of arbitrary style transfer. In this paper, we rethink what kind of attention mechanism is more appropriate for arbitrary style transfer. Our answer is a novel all-to-key attention mechanism: each position of content features is matched to key positions of style features. Specifically, it integrates two newly proposed attention forms: distributed and progressive attention. Distributed attention assigns attention to multiple key positions; Progressive attention pays attention from coarse to fine. All-to-key attention promotes the matching of diverse and reasonable style patterns and has linear complexity. The resultant module, dubbed StyA2K, has fine properties in rendering reasonable style textures and maintaining consistent local structure. Qualitative and quantitative experiments demonstrate that our method achieves superior results than state-of-the-art approaches.
translated by 谷歌翻译
Recently, a surge of high-quality 3D-aware GANs have been proposed, which leverage the generative power of neural rendering. It is natural to associate 3D GANs with GAN inversion methods to project a real image into the generator's latent space, allowing free-view consistent synthesis and editing, referred as 3D GAN inversion. Although with the facial prior preserved in pre-trained 3D GANs, reconstructing a 3D portrait with only one monocular image is still an ill-pose problem. The straightforward application of 2D GAN inversion methods focuses on texture similarity only while ignoring the correctness of 3D geometry shapes. It may raise geometry collapse effects, especially when reconstructing a side face under an extreme pose. Besides, the synthetic results in novel views are prone to be blurry. In this work, we propose a novel method to promote 3D GAN inversion by introducing facial symmetry prior. We design a pipeline and constraints to make full use of the pseudo auxiliary view obtained via image flipping, which helps obtain a robust and reasonable geometry shape during the inversion process. To enhance texture fidelity in unobserved viewpoints, pseudo labels from depth-guided 3D warping can provide extra supervision. We design constraints aimed at filtering out conflict areas for optimization in asymmetric situations. Comprehensive quantitative and qualitative evaluations on image reconstruction and editing demonstrate the superiority of our method.
translated by 谷歌翻译
As an important data selection schema, active learning emerges as the essential component when iterating an Artificial Intelligence (AI) model. It becomes even more critical given the dominance of deep neural network based models, which are composed of a large number of parameters and data hungry, in application. Despite its indispensable role for developing AI models, research on active learning is not as intensive as other research directions. In this paper, we present a review of active learning through deep active learning approaches from the following perspectives: 1) technical advancements in active learning, 2) applications of active learning in computer vision, 3) industrial systems leveraging or with potential to leverage active learning for data iteration, 4) current limitations and future research directions. We expect this paper to clarify the significance of active learning in a modern AI model manufacturing process and to bring additional research attention to active learning. By addressing data automation challenges and coping with automated machine learning systems, active learning will facilitate democratization of AI technologies by boosting model production at scale.
translated by 谷歌翻译
One-shot segmentation of brain tissues is typically a dual-model iterative learning: a registration model (reg-model) warps a carefully-labeled atlas onto unlabeled images to initialize their pseudo masks for training a segmentation model (seg-model); the seg-model revises the pseudo masks to enhance the reg-model for a better warping in the next iteration. However, there is a key weakness in such dual-model iteration that the spatial misalignment inevitably caused by the reg-model could misguide the seg-model, which makes it converge on an inferior segmentation performance eventually. In this paper, we propose a novel image-aligned style transformation to reinforce the dual-model iterative learning for robust one-shot segmentation of brain tissues. Specifically, we first utilize the reg-model to warp the atlas onto an unlabeled image, and then employ the Fourier-based amplitude exchange with perturbation to transplant the style of the unlabeled image into the aligned atlas. This allows the subsequent seg-model to learn on the aligned and style-transferred copies of the atlas instead of unlabeled images, which naturally guarantees the correct spatial correspondence of an image-mask training pair, without sacrificing the diversity of intensity patterns carried by the unlabeled images. Furthermore, we introduce a feature-aware content consistency in addition to the image-level similarity to constrain the reg-model for a promising initialization, which avoids the collapse of image-aligned style transformation in the first iteration. Experimental results on two public datasets demonstrate 1) a competitive segmentation performance of our method compared to the fully-supervised method, and 2) a superior performance over other state-of-the-art with an increase of average Dice by up to 4.67%. The source code is available at: https://github.com/JinxLv/One-shot-segmentation-via-IST.
translated by 谷歌翻译
幼稚的贝叶斯在许多应用中广泛使用,因为它具有简单性和处理数值数据和分类数据的能力。但是,缺乏特征之间的相关性建模会限制其性能。此外,现实世界数据集中的噪声和离群值也大大降低了分类性能。在本文中,我们提出了一种功能增强方法,该方法采用堆栈自动编码器来减少数据中的噪声并增强幼稚贝叶斯的判别能力。提出的堆栈自动编码器由两个用于不同目的的自动编码器组成。第一个编码器缩小了初始特征,以得出紧凑的特征表示,以消除噪声和冗余信息。第二个编码器通过将功能扩展到更高维度的空间中来增强特征的判别能力,从而使不同类别的样品在较高维度的空间中可以更好地分离。通过将提出的功能增强方法与正规化的幼稚贝叶斯集成,该模型的歧视能力得到了极大的增强。在一组机器学习基准数据集上评估所提出的方法。实验结果表明,所提出的方法显着且始终如一地优于最先进的天真贝叶斯分类器。
translated by 谷歌翻译
3D多对象跟踪(MOT)确保在连续动态检测过程中保持一致性,有利于自动驾驶中随后的运动计划和导航任务。但是,基于摄像头的方法在闭塞情况下受到影响,准确跟踪基于激光雷达的方法的对象的不规则运动可能是具有挑战性的。某些融合方法效果很好,但不认为在遮挡下出现外观特征的不可信问题。同时,错误检测问题也显着影响跟踪。因此,我们根据组合的外观运动优化(Camo-Mot)提出了一种新颖的相机融合3D MOT框架,该框架使用相机和激光镜数据,并大大减少了由遮挡和错误检测引起的跟踪故障。对于遮挡问题,我们是第一个提出遮挡头来有效地选择最佳对象外观的人,从而减少了闭塞的影响。为了减少错误检测在跟踪中的影响,我们根据置信得分设计一个运动成本矩阵,从而提高了3D空间中的定位和对象预测准确性。由于现有的多目标跟踪方法仅考虑一个类别,因此我们还建议建立多类损失,以在多类别场景中实现多目标跟踪。在Kitti和Nuscenes跟踪基准测试上进行了一系列验证实验。我们提出的方法在KITTI测试数据集上的所有多模式MOT方法中实现了最先进的性能和最低的身份开关(IDS)值(CAR为23,行人为137)。并且我们提出的方法在Nuscenes测试数据集上以75.3%的AMOTA进行了所有算法中的最新性能。
translated by 谷歌翻译
组织病理学图像包含丰富的表型信息和病理模式,这是疾病诊断的黄金标准,对于预测患者预后和治疗结果至关重要。近年来,在临床实践中迫切需要针对组织病理学图像的计算机自动化分析技术,而卷积神经网络代表的深度学习方法已逐渐成为数字病理领域的主流。但是,在该领域获得大量细粒的注释数据是一项非常昂贵且艰巨的任务,这阻碍了基于大量注释数据的传统监督算法的进一步开发。最新的研究开始从传统的监督范式中解放出来,最有代表性的研究是基于弱注释,基于有限的注释的半监督学习范式以及基于自我监督的学习范式的弱监督学习范式的研究图像表示学习。这些新方法引发了针对注释效率的新自动病理图像诊断和分析。通过对130篇论文的调查,我们对从技术和方法论的角度来看,对计算病理学领域中有关弱监督学习,半监督学习以及自我监督学习的最新研究进行了全面的系统综述。最后,我们提出了这些技术的关键挑战和未来趋势。
translated by 谷歌翻译
我们引入了一种新型的数学公式,用于训练以(可能非平滑)近端图作为激活函数的馈送前向神经网络的培训。该公式基于布雷格曼的距离,关键优势是其相对于网络参数的部分导数不需要计算网络激活函数的导数。我们没有使用一阶优化方法和后传播的组合估算参数(如最先进的),而是建议使用非平滑一阶优化方法来利用特定结构新颖的表述。我们提出了几个数值结果,这些结果表明,与更常规的培训框架相比,这些训练方法可以很好地很好地适合于培训基于神经网络的分类器和具有稀疏编码的(DeNoising)自动编码器。
translated by 谷歌翻译